Deployment: Model-agnostic methods

library(randomForest)
library(dplyr)
library(mltools)
library(data.table)
library(pdp)
library(plotly)
horas <- read.csv('hour.csv')
datos <- read.csv('day.csv')
house <- read.csv('kc_house_data.csv')

1.- One dimensional Partial Dependence Plot.

The partial dependence plot shows the marginal effect of a feature on the predicted outcome of a previously fit model.

Apply PDP to the regression example of predicting bike rentals. Fit a random forest approximation for the prediction of bike rentals (cnt). Use the partial dependence plot to visualize the relationships the model learned. Use the slides shown in class as model.

#Cargamos los datos 
days$dteday <- as_date(days$dteday) # pasamos de caracter a fecha 


#Selección de variables para el modelo:
datmod <- select(days, workingday, holiday, temp, hum, windspeed, cnt)
datmod$days_since_2011 <- int_length(interval(ymd("2011-01-01"), days$dteday)) / (3600*24)

#Creamos las nuevas variables para el modelo:
datmod$winter <- ifelse(days$season==1, 1, 0)
datmod$summer <- ifelse(days$season==3, 1, 0)
datmod$fall <- ifelse(days$season==4, 1, 0)
datmod$MISTY <- ifelse(days$weathersit == 2, 1, 0)
datmod$RAIN <- ifelse(days$weathersit == 3 | days$weathersit == 4, 1, 0)

#Al normalizar las variables se hace dificil interpretar el modelo, ya que los valores normalizados no tienen una interpretacion intuitiva. Al desnormalizar las variables nos permite interpretar los coeficientes del modelo individualmente de un modo mas sencillo. Pero, perdemos la capacidad de comparar las varibales, ya que cada uno tiene su propio rango de valores y escalas originales.
#t_norm * (t_max-t_min) + t_min = t
datmod$temp <- days$temp * (39-(-8)) - 8
#hum_nor = hum/100
datmod$hum <- days$hum * 100
#wind_nor = wind/67
datmod$windspeed <- days$windspeed * 67

library(pdp)
library(vip)

#Entrenamos el modelo
days_rf <- randomForest(cnt ~ ., data = datmod, importance = TRUE)

p1 <- partial(days_rf, pred.var = "days_since_2011", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")
p2 <- partial(days_rf, pred.var = "temp", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")
p3 <- partial(days_rf, pred.var = "hum", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")
p4 <- partial(days_rf, pred.var = "windspeed", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")

subplot(p1,p2,p3,p4, shareX = F, titleX = TRUE)

Analyse the influence of days since 2011, temperature, humidity and wind speed on the predicted bike counts.

As time goes on, the model predicts an increase in the number of rented bicycles, which is normal as the service becomes more well-known over time. For warm but not too hot climates, a large number of rented bikes is predicted. Yet, from temperatures over 27 ºC, the number of rented bikes decrease(too much heat). It appears that cyclists are increasingly inhibited from renting a bike when humidity exceeds 60%. Finally, the windier it gets, the less people like to ride a bike. which is logic. It appears that the model predicts the same from 25 km/h, maybe because there is little training data in that range.

2.- Bidimensional Partial Dependency Plot.

Generate a 2D Partial Dependency Plot with humidity and temperature to predict the number of bikes rented depending on those parameters.

Show the density distribution of both input features with the 2D plot as shown in the class slides.

library(tictoc)


#Seleccionamos un número de filas aleatorias
sampled <- sample_n(datmod, 40) 

#Filas del atributo 'temp' que tienen una equivalencia en el atributo 'hum'
temphum <- inner_join(data.frame(sampled$temp), data.frame(sampled$hum), by=character()) 
colnames(temphum) <- c("temperature","humidity")

temphum$prob <- 0 
for(i in 1:nrow(temphum)){
r <- datmod
r[["temp"]] <- temphum[["temperature"]][i]
r[["hum"]] <- temphum[["humidity"]][i]
pred <- predict(days_rf, r) 
temphum[["prob"]][i] <- sum(pred) / nrow(datmod) 
}

ggplot(temphum, aes(x=temperature, y=humidity)) + geom_tile(aes(fill=prob, width=10, height=10)) + labs(x="Temperature", y="Humidity") + guides(fill=guide_legend(title="Number of bikes")) + geom_rug()

QUESTION:

For the analysis of the two-dimensional PDP we must consider in unison the density graphs of each of the attributes (represented on their respective axes) with the legend which, by the intensity of the colour, indicates the estimated value of the number of bicycles rented. Thus, this legend shows that the areas with a lighter blue colour indicate a higher value of the predicted response variable, and alternatively the darker tones correspond to lower estimated values for the number of bicycles.

We focus our explanation on the ranges of values of our attributes for which we can draw general interpretations with respect to the predictive model. In our case for temperature (from 4ºC to 28ºC) and for humidity (from 44% to 81%) we observe nine clearly differentiated zones in terms of tone.

In the first zone (Temperature: 4-8ºC, Humidity: 75-81%) we observe that it is where the number of rented bicycles is the lowest (between 3000 and 3500 units) just where low temperature and high humidity conditions meet.

In the last zone (temperature: 17-28ºC, humidity: 44-61%) delimited by high temperature and low humidity conditions, the highest number of rented bicycles is predicted (approximately 5000 units).

Both boundary situations correspond to the direct relationship between our response variable with the variable ‘temp’ as well as the inverse relationship with the variable ‘hum’.

3.- PDP to explain the price of a house.

Apply the previous concepts to predict the price of a house from the database kc_house_data.csv. In this case, use again a random forest approximation for the prediction based on the features bedrooms, bathrooms, sqft_living, sqft_lot, floors and yr_built.

Use the partial dependence plot to visualize the relationships the model learned.

#Selección de filas aleatorias
sample_house <- sample_frac(house, 0.2)
sample_house <- select(sample_house, bedrooms, bathrooms, sqft_living, sqft_lot, floors, yr_built, price)

#Entrenamos el modelo
house_rf <- randomForest(price ~ ., data = sample_house, importance = TRUE)

p1 <- partial(house_rf, pred.var = "bedrooms", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")
p2 <- partial(house_rf, pred.var = "bathrooms", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")
p3 <- partial(house_rf, pred.var = "sqft_living", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")
p4 <- partial(house_rf, pred.var = "floors", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")

subplot(p1,p2,p3,p4, shareX = FALSE, titleX = TRUE)
NA

QUESTION:

Analyse the influence of bedrooms, bathrooms, sqft_living and floors on the predicted price.

The general interpretation of the ‘post hoc’ PDP model for the attribute ‘bedrooms’ with respect to the predicted model according to its density graph allows us to draw conclusions in the range of 1 to 4 bedrooms. We observe two different trends: an increasing trend, i.e. a rise in price from 1 to 2 bedrooms (from 572000 to 581000\(); while a decrease in price from 2 to 4 bedrooms (from 581000 to 542000\)). We can understand that the most requested number of rooms per house, and therefore the most expensive, are those with 2 rooms. In the range above these, as it is not a usual number of rooms, we interpret this as the reason why the predictive model justifies the trend described above.

As for the variable ‘bathrooms’ we can observe a positive correlation between it and the estimated price of the house, since normally a house with more bathrooms is considered a more luxurious house and therefore more expensive, it is worth noting the clear increase in price if we go from 3 to 4 bathrooms. We can only draw valid conclusions if we take into account those dwellings with 0 to 4 bathrooms, as we do not have enough data recorded for dwellings with a higher number of bathrooms.

As for the variable ‘sqft_living’, the relationship is similar to the case of bathrooms, a larger dwelling will obviously be more expensive as it occupies more land. We have data up to about 5000 square metres, for those dwellings whose size exceeds this, we do not have enough data recorded and therefore cannot draw valid conclusions.

For the last PDP graph, in relation to the analysis of the variable ‘floors’ being the increasing trend we observe a notable difference in the price of the house when going from one floor to two, and above all, if there is a third floor in the house. Considering the small difference between one floor and two floors (from 540000 to 560000$), taking into account the demographic characteristics mentioned above and considering the climatological characteristics of the area (short, hot and dry summers and cold and wet winters) the construction estimated by the two-floor model is usually the most demanded, placing the living areas and kitchen on one floor and the bedrooms on the upper floor. Of course, a third floor is already a big budget deviation and is considered to be an exceptional house.

---
title: "PR5 DEPLOYMENT"
output: html_notebook
---

# Deployment: Model-agnostic methods

```{r}
library(randomForest)
library(dplyr)
library(mltools)
library(data.table)
library(pdp)
library(plotly)
```

```{r}
horas <- read.csv('hour.csv')
datos <- read.csv('day.csv')
house <- read.csv('kc_house_data.csv')
```

### **1.- One dimensional Partial Dependence Plot.**

The partial dependence plot shows the marginal effect of a feature on the predicted outcome of a previously fit model.

Apply PDP to the regression example of predicting bike rentals. Fit a random forest approximation for the prediction of bike rentals (**cnt**). Use the partial dependence plot to visualize the relationships the model learned. Use the slides shown in class as model.

```{r}
#Cargamos los datos 
days$dteday <- as_date(days$dteday) # pasamos de caracter a fecha 


#Selección de variables para el modelo:
datmod <- select(days, workingday, holiday, temp, hum, windspeed, cnt)
datmod$days_since_2011 <- int_length(interval(ymd("2011-01-01"), days$dteday)) / (3600*24)

#Creamos las nuevas variables para el modelo:
datmod$winter <- ifelse(days$season==1, 1, 0)
datmod$summer <- ifelse(days$season==3, 1, 0)
datmod$fall <- ifelse(days$season==4, 1, 0)
datmod$MISTY <- ifelse(days$weathersit == 2, 1, 0)
datmod$RAIN <- ifelse(days$weathersit == 3 | days$weathersit == 4, 1, 0)

#Al normalizar las variables se hace dificil interpretar el modelo, ya que los valores normalizados no tienen una interpretacion intuitiva. Al desnormalizar las variables nos permite interpretar los coeficientes del modelo individualmente de un modo mas sencillo. Pero, perdemos la capacidad de comparar las varibales, ya que cada uno tiene su propio rango de valores y escalas originales.
#t_norm * (t_max-t_min) + t_min = t
datmod$temp <- days$temp * (39-(-8)) - 8
#hum_nor = hum/100
datmod$hum <- days$hum * 100
#wind_nor = wind/67
datmod$windspeed <- days$windspeed * 67

library(pdp)
library(vip)

#Entrenamos el modelo
days_rf <- randomForest(cnt ~ ., data = datmod, importance = TRUE)

p1 <- partial(days_rf, pred.var = "days_since_2011", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")
p2 <- partial(days_rf, pred.var = "temp", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")
p3 <- partial(days_rf, pred.var = "hum", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")
p4 <- partial(days_rf, pred.var = "windspeed", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")

subplot(p1,p2,p3,p4, shareX = F, titleX = TRUE)
```

Analyse the influence of **days since 2011**, **temperature**, **humidity** and **wind speed** on the predicted bike counts.

As time goes on, the model predicts an increase in the number of rented bicycles, which is normal as the service becomes more well-known over time. For warm but not too hot climates, a large number of rented bikes is predicted. Yet, from temperatures over 27 ºC, the number of rented bikes decrease(too much heat). It appears that cyclists are increasingly inhibited from renting a bike when humidity exceeds 60%. Finally, the windier it gets, the less people like to ride a bike. which is logic. It appears that the model predicts the same from 25 km/h, maybe because there is little training data in that range.

### 2.- Bidimensional Partial Dependency Plot.

Generate a 2D Partial Dependency Plot with humidity and temperature to predict the number of bikes rented depending on those parameters.

Show the density distribution of both input features with the 2D plot as shown in the class slides.

```{r}
library(tictoc)


#Seleccionamos un número de filas aleatorias
sampled <- sample_n(datmod, 40) 

#Filas del atributo 'temp' que tienen una equivalencia en el atributo 'hum'
temphum <- inner_join(data.frame(sampled$temp), data.frame(sampled$hum), by=character()) 
colnames(temphum) <- c("temperature","humidity")

temphum$prob <- 0 
for(i in 1:nrow(temphum)){
r <- datmod
r[["temp"]] <- temphum[["temperature"]][i]
r[["hum"]] <- temphum[["humidity"]][i]
pred <- predict(days_rf, r) 
temphum[["prob"]][i] <- sum(pred) / nrow(datmod) 
}

ggplot(temphum, aes(x=temperature, y=humidity)) + geom_tile(aes(fill=prob, width=10, height=10)) + labs(x="Temperature", y="Humidity") + guides(fill=guide_legend(title="Number of bikes")) + geom_rug()

```

**QUESTION:**

For the analysis of the two-dimensional PDP we must consider in unison the density graphs of each of the attributes (represented on their respective axes) with the legend which, by the intensity of the colour, indicates the estimated value of the number of bicycles rented. Thus, this legend shows that the areas with a lighter blue colour indicate a higher value of the predicted response variable, and alternatively the darker tones correspond to lower estimated values for the number of bicycles.

We focus our explanation on the ranges of values of our attributes for which we can draw general interpretations with respect to the predictive model. In our case for temperature (from 4ºC to 28ºC) and for humidity (from 44% to 81%) we observe nine clearly differentiated zones in terms of tone.

In the first zone (Temperature: 4-8ºC, Humidity: 75-81%) we observe that it is where the number of rented bicycles is the lowest (between 3000 and 3500 units) just where low temperature and high humidity conditions meet.

In the last zone (temperature: 17-28ºC, humidity: 44-61%) delimited by high temperature and low humidity conditions, the highest number of rented bicycles is predicted (approximately 5000 units).

Both boundary situations correspond to the direct relationship between our response variable with the variable 'temp' as well as the inverse relationship with the variable 'hum'.

### **3.- PDP to explain the price of a house.**

Apply the previous concepts to predict the **price** of a house from the database **kc_house_data.csv**. In this case, use again a random forest approximation for the prediction based on the features **bedrooms**, **bathrooms**, **sqft_living**, **sqft_lot**, **floors** and **yr_built**.

Use the partial dependence plot to visualize the relationships the model learned.

```{r}
#Selección de filas aleatorias
sample_house <- sample_frac(house, 0.2)
sample_house <- select(sample_house, bedrooms, bathrooms, sqft_living, sqft_lot, floors, yr_built, price)

#Entrenamos el modelo
house_rf <- randomForest(price ~ ., data = sample_house, importance = TRUE)

p1 <- partial(house_rf, pred.var = "bedrooms", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")
p2 <- partial(house_rf, pred.var = "bathrooms", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")
p3 <- partial(house_rf, pred.var = "sqft_living", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")
p4 <- partial(house_rf, pred.var = "floors", plot = TRUE, rug = TRUE, plot.engine = "ggplot2")

subplot(p1,p2,p3,p4, shareX = FALSE, titleX = TRUE)

```

**QUESTION:**

***Analyse the influence of bedrooms, bathrooms, sqft_living and floors on the predicted price.***

The general interpretation of the 'post hoc' PDP model for the attribute **'bedrooms'** with respect to the predicted model according to its density graph allows us to draw conclusions in the range of 1 to 4 bedrooms. We observe two different trends: an increasing trend, i.e. a rise in price from 1 to 2 bedrooms (from 572000 to 581000$); while a decrease in price from 2 to 4 bedrooms (from 581000 to 542000$). We can understand that the most requested number of rooms per house, and therefore the most expensive, are those with 2 rooms. In the range above these, as it is not a usual number of rooms, we interpret this as the reason why the predictive model justifies the trend described above.

As for the variable **'bathrooms'** we can observe a positive correlation between it and the estimated price of the house, since normally a house with more bathrooms is considered a more luxurious house and therefore more expensive, it is worth noting the clear increase in price if we go from 3 to 4 bathrooms. We can only draw valid conclusions if we take into account those dwellings with 0 to 4 bathrooms, as we do not have enough data recorded for dwellings with a higher number of bathrooms.

As for the variable **'sqft_living'**, the relationship is similar to the case of bathrooms, a larger dwelling will obviously be more expensive as it occupies more land. We have data up to about 5000 square metres, for those dwellings whose size exceeds this, we do not have enough data recorded and therefore cannot draw valid conclusions.

For the last PDP graph, in relation to the analysis of the variable **'floors'** being the increasing trend we observe a notable difference in the price of the house when going from one floor to two, and above all, if there is a third floor in the house. Considering the small difference between one floor and two floors (from 540000 to 560000\$), taking into account the demographic characteristics mentioned above and considering the climatological characteristics of the area (short, hot and dry summers and cold and wet winters) the construction estimated by the two-floor model is usually the most demanded, placing the living areas and kitchen on one floor and the bedrooms on the upper floor. Of course, a third floor is already a big budget deviation and is considered to be an exceptional house.
